Causal inference is the process of determining the independent, actual effect of a particular phenomenon that is a component of a larger system. The main difference between causal inference and inference of association is that causal inference analyzes the response of an effect variable when a cause of the effect variable is changed. The study of why things occur is called etiology, and can be described using the language of Causal notation. Causal inference is said to provide the evidence of causality theorized by causal reasoning.
Causal inference is widely studied across all sciences. Several innovations in the development and implementation of methodology designed to determine causality have proliferated in recent decades. Causal inference remains especially difficult where experimentation is difficult or impossible, which is common throughout most sciences.
The approaches to causal inference are broadly applicable across all types of scientific disciplines, and many methods of causal inference that were designed for certain disciplines have found use in other disciplines. This article outlines the basic process behind causal inference and details some of the more conventional tests used across different disciplines; however, this should not be mistaken as a suggestion that these methods apply only to those disciplines, merely that they are the most commonly used in that discipline.
Causal inference is difficult to perform and there is significant debate amongst scientists about the proper way to determine causality. Despite other innovations, there remain concerns of misattribution by scientists of correlative results as causal, of the usage of incorrect methodologies by scientists, and of deliberate manipulation by scientists of analytical results in order to obtain statistically significant estimates. Particular concern is raised in the use of regression models, especially linear regression models.
Epidemiological studies employ different epidemiological methods of collecting and measuring evidence of risk factors and effect and different ways of measuring association between the two. Results of a 2020 review of methods for causal inference found that using existing literature for clinical training programs can be challenging. This is because published articles often assume an advanced technical background, they may be written from multiple statistical, epidemiological, computer science, or philosophical perspectives, methodological approaches continue to expand rapidly, and many aspects of causal inference receive limited coverage.
Common frameworks for causal inference include the causal pie model (component-cause), Causal model (causal diagram + do-calculus), structural equation modeling, and Rubin causal model (potential-outcome), which are often used in areas such as social sciences and epidemiology.
In molecular epidemiology the phenomena studied are on a molecular biology level, including genetics, where biomarkers are evidence of cause or effects.
A recent trend is to identify evidence for influence of the exposure on molecular pathology within diseased tissue or cells, in the emerging interdisciplinary field of molecular pathological epidemiology (MPE). Linking the exposure to molecular pathologic signatures of the disease can help to assess causality. Considering the inherent nature of heterogeneity of a given disease, the unique disease principle, disease phenotyping and subtyping are trends in biomedical and public health sciences, exemplified as personalized medicine and precision medicine. Causal Inference has also been used for treatment effect estimation. Assuming a set of observable patient symptoms( X) caused by a set of hidden causes( Z) we can choose to give or not a treatment t. The result of the giving or not giving the treatment is the effect estimation y. If the treatment is not guaranteed to have a positive effect then the decision whether the treatment should be applied or not depends firstly on expert knowledge that encompasses the causal connections. For novel diseases, this expert knowledge may not be available. As a result, we rely solely on past treatment outcomes to make decisions. A modified variational autoencoder can be used to model the causal graph described above. While the above scenario could be modelled without the use of the hidden confounder(Z) we would lose the insight that the symptoms a patient together with other factors impacts both the treatment assignment and the outcome.
Here are some of the noise models for the hypothesis Y → X with the noise E:
The common assumption in these models are:
On an intuitive level, the idea is that the factorization of the joint distribution P(Cause, Effect) into P(Cause)*P(Effect | Cause) typically yields models of lower total complexity than the factorization into P(Effect)*P(Cause | Effect). Although the notion of "complexity" is intuitively appealing, it is not obvious how it should be precisely defined. A different family of methods attempt to discover causal "footprints" from large amounts of labeled data, and allow the prediction of more flexible causal relations.Lopez-Paz, David, et al. " Towards a learning theory of cause-effect inference " ICML. 2015
While much of the emphasis remains on statistical inference in the potential outcomes framework, social science methodologists have developed new tools to conduct causal inference with both qualitative and quantitative methods, sometimes called a "mixed methods" approach. Advocates of diverse methodological approaches argue that different methodologies are better suited to different subjects of study. Sociologist Herbert Smith and political scientists James Mahoney and Gary Goertz have cited the observation of Paul W. Holland, a statistician and author of the 1986 article "Statistics and Causal Inference", that statistical inference is most appropriate for assessing the "effects of causes" rather than the "causes of effects". Qualitative methodologists have argued that formalized models of causation, including process tracing and fuzzy set theory, provide opportunities to infer causation through the identification of critical factors within case studies or through a process of comparison among several case studies. These methodologies are also valuable for subjects in which a limited number of potential observations or the presence of confounding variables would limit the applicability of statistical inference.
On longer timescales, persistence studies uses causal inference to link historical events to later political, economic and social outcomes.
Despite the difficulties inherent in determining causality in economic systems, several widely employed methods exist throughout those fields.
Model specification can be useful in determining causality that is slow to emerge, where the effects of an action in one period are only felt in a later period. It is worth remembering that correlations only measure whether two variables have similar variance, not whether they affect one another in a particular direction; thus, one cannot determine the direction of a causal relation based on correlations only. Because causal acts are believed to precede causal effects, social scientists can use a model that looks specifically for the effect of one variable on another over a period of time. This leads to using the variables representing phenomena happening earlier as treatment effects, where econometric tests are used to look for later changes in data that are attributed to the effect of such treatment effects, where a meaningful difference in results following a meaningful difference in treatment effects may indicate causality between the treatment effects and the measured effects (e.g., Granger-causality tests). Such studies are examples of Time series.
A chief motivating concern in the use of sensitivity analysis is the pursuit of discovering confounding variables. Confounding variables are variables that have a large impact on the results of a statistical test but are not the variable that causal inference is trying to study. Confounding variables may cause a regressor to appear to be significant in one implementation, but not in another.
However, there are limits to sensitivity analysis' ability to prevent the deleterious effects of multicollinearity, especially in the social sciences, where systems are complex. Because it is theoretically impossible to include or even measure all of the confounding factors in a sufficiently complex system, econometric models are susceptible to the common-cause fallacy, where causal effects are incorrectly attributed to the wrong variable because the correct variable was not captured in the original data. This is an example of the failure to account for a Confounding.
Separate from the difficulties of causal inference, the perception that large numbers of scholars in the social sciences engage in non-scientific methodology exists among some large groups of social scientists. Criticism of economists and social scientists as passing off descriptive studies as causal studies are rife within those fields.
One prominent example of common non-causal methodology is the erroneous assumption of correlative properties as causal properties. There is no inherent causality in phenomena that correlate. Regression models are designed to measure variance within data relative to a theoretical model: there is nothing to suggest that data that presents high levels of covariance have any meaningful relationship (absent a proposed causal mechanism with predictive properties or a random assignment of treatment). The use of flawed methodology has been claimed to be widespread, with common examples of such malpractice being the overuse of correlative models, especially the overuse of regression models and particularly linear regression models. The presupposition that two correlated phenomena are inherently related is a logical fallacy known as spurious correlation. Some social scientists claim that widespread use of methodology that attributes causality to spurious correlations have been detrimental to the integrity of the social sciences, although improvements stemming from better methodologies have been noted.
A potential effect of scientific studies that erroneously conflate correlation with causality is an increase in the number of scientific findings whose results are not reproducible by third parties. Such non-reproducibility is a logical consequence of findings that correlation only temporarily being overgeneralized into mechanisms that have no inherent relationship, where new data does not contain the previous, idiosyncratic correlations of the original data. Debates over the effect of malpractice versus the effect of the inherent difficulties of searching for causality are ongoing. Critics of widely practiced methodologies argue that researchers have engaged statistical manipulation in order to publish articles that supposedly demonstrate evidence of causality but are actually examples of spurious correlation being touted as evidence of causality: such endeavors may be referred to as Data dredging. To prevent this, some have advocated that researchers preregister their research designs prior to conducting to their studies so that they do not inadvertently overemphasize a nonreproducible finding that was not the initial subject of inquiry but was found to be statistically significant during data analysis.
|
|